-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B
#13386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi there! I tried this PR (applied the changes on top of the last commit, to use MLA + FA) on DeepSeek V3 0324, but I noticed less PP performance. I think it doesn't saturates the PCI-E for the main GPU when using the flag vs without, which results in less PP t/s. Command to run was RX/TX usage without the flag (GPU 0 gets saturated) while doing PP RX/TX usage with the flag while doing PP So PP t/s go from 66 t/s to 26 t/s No flag: Flag: Maybe there is an incompatible flag I'm using? |
|
Using the flag should place way more load on the CPU, and whether that's beneficial at all should be highly specific to model offload params and hardware config. For now I personally only tested llama 4 and qwen 3 with a relatively performant CPU (7970X) and a single relatively weak GPU (W7900), with a simple ./build/bin/llama-bench -m ~/models/llama4-400b-hybrid-q8_0-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 --disable-op-offload 1,0 -n 0
./build/bin/llama-bench -m ~/models/qwen3-235b-a22b-q8_0-q4_0-hybrid.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 --disable-op-offload 1,0 -n 0
|
|
Ah then that could be the reason then. I have 192GB RAM on a Ryzen 7 7800X3D, which is a consumer CPU so it is pretty weak for these tasks. |
|
Could we not just parameterise the fixed |
|
A setting to control the minimum batch size would need to be per-backend, and configured via an environment variable. |
Ah, sorry I forgot the |
--disable-op-offload to improve -ot pp perf in MoE models like llama4 400B--no-op-offload to improve -ot pp perf in MoE models like llama4 400B
|
FYI: using without the flag |


This change is to address issue #13241
Including llama-bench support to help with performance tuning